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Abstract 

We present a powerful general framework for designing data-dependent optimiza¬ 
tion algorithms, building upon and unifying recent techniques in adaptive regulariza¬ 
tion, optimistic gradient predictions, and problem-dependent randomization. We first 
present a series of new regret guarantees that hold at any time and under very mini¬ 
mal assumptions, and then show how different relaxations recover existing algorithms, 
both basic as well as more recent sophisticated ones. Finally, we show how combining 
adaptivity, optimism, and problem-dependent randomization can guide the design of 
algorithms that benefit from more favorable guarantees than recent state-of-the-art 
methods. 


1 Introduction 

Online convex optimization algorithms represent key tools in modern machine learning. 
These are flexible algorithms used for solving a variety of optimization problems in classi¬ 
fication, regression, ranking and probabilistic inference. These algorithms typically process 
one sample at a time with an update per iteration that is often computationally cheap and 
easy to implement. As a result, they can be substantially more efficient both in time and 
space than standard batch learning algorithms, which often have optimization costs that 
are prohibitive for very large data sets. 

In the standard scenario of online convex optimization (Zinkevich, 2003), at each round 
t = 1 , 2 ,..., the learner selects a point Xt out of a compact convex set K, and incurs loss 
ft{xt), where ft is a convex function defined over 1C. The learner’s objective is to find an 
algorithm A that minimizes the regret with respect to a fixed point x*: 

T 

RegT.(yl,a;*) ='^ft{xt)- ft{x*) 

t=i 

that is the difference between the learner’s cumulative loss and the loss in hindsight incurred 
by X*, or with respect to the loss of the best x* in /C, Regy(yl) = max^j.^yc RegT(”4, x*). We 
will assume only that the learner has access to the gradient or an element of the sub-gradient 
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of the loss functions /t, but that the loss functions ft can be arbitrarily singular and flat, 
e.g. not necessarily strongly convex or strongly smooth. This is the most general setup of 
convex optimization in the full information setting. It can be applied to standard convex 
optimization and online learning tasks as well as to many optimization problems in machine 
learning such as those of SVMs, logistic regression, and ridge regression. Favorable bounds 
in online convex optimization can also be translated into strong learning guarantees in the 
standard scenario of batch supervised learning using online-to-batch conversion guarantees 
(Littlestone, 1989; Cesa-Bianchi et ah, 2004; Mohri et ah, 2012). 

In the scenario of online convex optimization just presented, minimax optimal rates can 
be achieved by standard algorithms such as online gradient descent (Zinkevich, 2003). How¬ 
ever, general minimax optimal rates may be too conservative. Recently, adaptive regular¬ 
ization methods have been introduced for standard descent methods to achieve tighter data- 
dependent regret bounds (see (Bartlett et ah, 2007), (Duchi et ah, 2010), (McMahan and Streeter, 
2010), (McMahan, 2014), (Orabona et ah, 2013)). Specifically, in the “AdaGrad” framework 
of (Duchi et ah, 2010), there exists a sequence of convex functions ipt such that the update 
xt+i = argmin,jg;^ pgjx + (x, Xt) yields regret: 


Reg-j.(M, x) < v^maxljx 


n 


T 





where gt € dft{xt) is an element of the subgradient of ft at Xj, gi-T,i = X]t=i 9t,i^ 

is the Bregman divergence defined using the convex function iff This upper bound on the 

regret has shown to be within a factor y/2 of the optimal a posteriori regret: 


T 

Note, however, that this upper bound on the regret can still be very large, even if the 
functions ft admit some favorable properties (e.g. ft = /, linear). This is because the 
dependence is directly on the norm of gtS. 

An alternative line of research has been investigated by a series of recent publications 
that have analyzed online learning in “slowly-varying” scenarios (Kazan and Kale, 2009; 
Chiang et ah, 2012; Rakhlin and Sridharan, 2013; Chiang et ah, 2013). In the framework 
of (Rakhlin and Sridharan, 2013), if 7?, is a self-concordant function, || • || v2TC(xt) is the semi¬ 
norm induced by its Hessian at the point xtf and gt+i = gt-\-i{gi, ■ ■ . , 54 ,xi,... ,xt) is a 
“prediction” of a time t 1 subgradient 5 t+i based on information up to time t, then one 
can obtain regret bounds of the following form: 

1 ^ 

Reg-p(M, x) < -TZ{x) -^2g E llfft — gtWv^Tzi Xt),* * 

^ t=i 

Here, |1 • \\y^n{xt),* denotes the dual norm of || • Wv'^nixt)- for any x, ||x||v2K(a:t).* = 
®^P||y|lv 2 -R( This guarantee can be very favorable in the optimistic case where 

gt ~ gt for all t. Nevertheless, it admits the drawback that much less control is available 

^The norm induced by a symmetric positive definite (SPD) matrix A is defined for any x by ||fc||>i = 
y/x^ Ax. 


2 








Algorithm 1 AO-FTRL 
1: Input: regularization function tq > 0. 

2: Initialize: gi = 0, xi = argmin^g^^^ ro(a::). 

3: for t = 1,..., T: do 
4: Compute gt G dft{xt). 

5: Construct regularizer rt > 0. 

6: Predict gradient gt+i = gt+i{gi,... ,gt,Xi,... ,Xt). 

7: Update Xt+I = arginin^i:* • x + gt+i ■ x + rQ.,t{x). 


over the induced norm since it is difficult to predict, for a given self-concordant function TZ, 
the behavior of its Hessian at the points Xt selected by an algorithm. Moreover, there is 
no guarantee of “near-optimality” with respect to an optimal a posteriori regularization as 
there is with the adaptive algorithm. 

This paper presents a powerful general framework for designing online convex opti¬ 
mization algorithms combining adaptive regularization and optimistic gradient prediction 
which helps address several of the issues just pointed out. Our framework builds upon and 
unifies recent techniques in adaptive regularization, optimistic gradient predictions, and 
problem-dependent randomization. In Section 2, we describe a series of adaptive and op¬ 
timistic algorithms for which we prove strong regret guarantees, including a new Adaptive 
and Optimistic Follow-the-Regularized-Leader (AO-FTRL) algorithm (Section 2.1) and a 
more general version of this algorithm with composite terms (Section 2.3). These new re¬ 
gret guarantees hold at any time and under very minimal assumptions. We also show how 
different relaxations recover both basic existing algorithms as well as more recent sophisti¬ 
cated ones. In a specific application, we will also show how a certain choice of regularization 
functions will produce an optimistic regret bound that is also nearly a posteriori optimal, 
combining the two different desirable properties mentioned above. Lastly, in Section 3, we 
further combine adaptivity and optimism with problem-dependent randomization to devise 
algorithms benefitting from more favorable guarantees than recent state-of-the-art methods. 

2 Adaptive and Optimistic Follow-the-Regularized-Leader 
algorithms 

2.1 AO-FTRL algorithm 

In view of the discussion in the previous section, we present an adaptive and optimistic 
version of the Follow-the-Regularized-Leader (FTRL) family of algorithms. In each round 
of standard FTRL, a point is chosen that is the minimizer of the average linearized loss 
incurred plus a regularization term. In our new version of FTRL, we will find a minimizer 
of not only the average loss incurred, but also a prediction of the next round’s loss. In 
addition, we will define a dynamic time-varying sequence of regularization functions that 
can be used to optimize against this new loss term. Algorithm 1 shows the pseudocode of 
our Adaptive and Optimistic Follow-the-Regularized-Leader (AO-FTRL) algorithm. 

The following result provides a regret guarantee for the algorithm when one uses proximal 
regularizers, i.e. functions rt such that argmin,j,g^ rt(a;) = xt- 
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Theorem 1 (AO-FTRL-Prox). Let {rt} be a sequence of proximal non-negative functions, 
and let gt be the learner’s estimate of gt given the history of functions fi,...,ft-i and 
points Xi,... ,xt-i. Assume further that the function ho-t : x i—>■ gi:t ■ x + gt+i ■ x + ro:t{x) 
is 1-strongly convex with respect to some norm || • ||(j) (i.e. rg-t is 1-strongly convex with 
respect to || • ||(t)/ Then, the following regret bound holds for AO-FTRL (Algorithm 1): 

T T 

Reg-r(AO-FTRL, x) = ^ ft{xt) - ft{x) < ro,T{x) + ^ \\gt - 5i|l(t),* ■ 

Proof. Recall that Xt+i = argmiii 3 .(^i:i + 9t+i) * ^ = argmin^.let 

yt = argmin^ x • gi-^t + Then, by convexity, the following inequality holds: 

T T 

'^ft{xt) - ft{x) < • (xt - x) 

t=l 

T 

= - gt) ■ {xt - yt) + gt ■ (xt - yt) + gt ■ {yt - x). 

Now, we first prove by induction on T that for all x G JC the following inequality holds: 

T T 

• {xt - yt) + 9t -yt^^gt -x + ro-.Tix). 


For T = 1, since = 0 and ri > 0, the inequality follows by the definition of yi. Now, 
suppose the inequality holds at iteration T. Then, we can write 


T+l 

y] gt ■ {xt - yt) Fgt-yt 


< 


■ T 

y] gt ■ {xt - yt) Fgt-yt 
+ ffT-i-i • {xt+1 — yr+i) + gr+i ■ yr+i 

■ T 

y] gt ■ XT+i + ro:T(xT-|-l) 

.t=i 


+ gTJrl ■ (xT-I-1 — yrJrl) + grJrl ■ 2/T+l 
(by the induction hypothesis for x = xt-i-i) 

< [( 3 i:T + gr+i) ■ xt +1 + xo:t-i-i(xt-i-i)] 

+ gr+i ■ (—yr+i) + gr+i ■ yr+i 

(since rt > 0, Vt) 

^ [(5i:T +3r+i) ■ yr+i + xo:r+i(yT-i-i)] 

+ gr+i ■ {—yr+i) + gr+i ■ yx+i 

(by definition of Xt-i-i) 

< gi-.T+i ■ y + rQ,T+i{y), for any y. 

(by definition of yr-i-i) 


Thus, we have that ft{xt) - ft{x) < ro,T{x) Jfn^iigt - gt) ■ {xt - yt) and it suffices 
to bound Y^=i{gt ~ gt)'^{xt — yt)- Notice that, by duality, one can immediately write 
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idt - gtV {xt - yt) < \\gt-gt\\{t),*\\xt-yt\\{t)- To bound \\xt-yt\\{t) in terms of the gradient, 
recall first that since rt is proximal and Xt = argmin^j, hg-t-i, 

Xt = argmin/io:t_i(x) +rt{x), 

X 

yt = argmin/io:t-i(a:) + rt{x) + {gt - gt) ■ x. 

X 

The fact that ro:t(x) is 1-strongly convex with respect to the norm || • ||p) implies that 
ho:t-i + Tt is as well. In particular, it is 1-strongly convex at the points Xt and yt- But 
this then implies that the conjugate function is 1-strongly smooth on the image of the 
gradient, including at V(/io:t-i + rt){xt) = 0 and V(/io:t-i + n)iyt) = -{gt - gt) (see 
Lemma 1 in the appendix or (Rockafellar, 1970) for a general reference), which means that 
||V((/io:t-i + rt)*){-{gt - gt)) - V((ho:t-i -f r't)*)(0)||(t) < \\gt - gt\\{t),*- 

Since V((ho:t_i -f rt)*){-{gt - gt)) = yt and V((ho:t_i -f rt)*)(0) = Xt, we have that 

\\xt-yt\\it) <\\gt-M{t).*- □ 

The regret bound just presented can be vastly superior to the adaptive methods of 
(Duchi et ah, 2010), (McMahan and Streeter, 2010), and others. For instance, one common 
choice of gradient prediction is gt+i = gt, so that for slowly varying gradients (e.g. nearly 
“flat” functions), gt — gt ^ 0, but ||gt||(t) = ||5l|(t)- Moreover, for reasonable gradient pre¬ 
dictions, ~ llfftll(t) generally, so that in the worst case. Algorithm I’s regret will 

be at most a factor of two more than standard methods. At the same time, the use of non 
self-concordant regularization allows one to more explicitly control the induced norm in the 
regret bound as well as provide more efficient updates than those of (Rakhlin and Sridharan, 
2013). Section 2.2.1 presents an upgraded version of online gradient descent as an exam¬ 
ple, where our choice of regularization allows our algorithm to accelerate as the gradient 
predictions become more accurate. 

Note that the assumption of strong convexity of /io:t is not a significant constraint, as any 
quadratic or entropic regularizer from the standard mirror descent algorithms will satisfy 
this property. 

Moreover, if the loss functions {ft}^i themselves are 1-strongly convex, then one can set 
ro,t = 0 and still get a favorable induced norm || ■ ^ = yll ■ H^. If the gradients and gradient 

predictions are uniformly bounded, this recovers the worst-case log(T) regret bounds. At 
the same time, Algorithm 1 would also still retain the potentially highly favorable data- 
dependent and optimistic regret bound. 

Liang and Steinhardt (2014) (Steinhardt and Liang, 2014) also studied adaptivity and 
optimism in online learning in the context of mirror descent-type algorithms. If, in the proof 
above, we assume their condition: 

r^.t+ii-ggi-t) < rl,t{-v{gi.t - gt)) - gxj{gt - gt), 

then we obtain the following regret bound: ft{xt) — J2t=i ft{x) < ’~Ao)+’~o:t+i(3^) ^ 

algorithm, however, is generally easier to use since it holds for any sequence of regularization 
functions and does not require checking for that condition. 

In some cases, it may be preferable to use non-proximal adaptive regularization. Since 
non-adaptive non-proximal FTRL corresponds to dual averaging, this scenario arises, for in¬ 
stance, when one wishes to use regularizers such as the negative entropy to derive algorithms 
from the Exponentiated Gradient (EG) family (see (Shalev-Shwartz, 2012) for background). 
We thus present the following theorem for this family of algorithms: Adaptive Optimistic 
Follow-the-Regularized-Leader - General version (AO-FTRL-Gen). 
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Theorem 2 (AO-FTRL-Gen). Let {rt\ he a sequence of non-negative functions, and let 
gt he the learner’s estimate of gt given the history of functions fi,..., ft-i and points 
xi,...,xt-i. Assume further that the function hQ,t- X gi:t ■ X gt+i ■ X rQ.t{x) is 1- 
strongly convex with respect to some norm || • ||(t) (i.e. ro-t is 1-strongly convex wrt || • ||(t)y)- 
Then, the following regret hound holds for AO-FTRL (Algorithm 1): 

T T 

'^ft{xt) - ft{x) < ro,T-i{x) + '^\\gt - gtWft-i),* 

t=i t=i 

Due to spatial constraints, the proof of this theorem, as well as that of all further results 
in the remainder of Section 2, are presented in Appendix 5. 

As in the case of proximal regularization, Algorithm 1 applied to general regularizers still 
admits the same benefits over the standard adaptive algorithms. In particular, the above 
algorithm is an easy upgrade over any dual averaging algorithm. Section 2.2.2 illustrates 
one such example for the Exponentiated Gradient algorithm. 

Corollary 1. With the following suitable choices of the parameters in Theorem 3, the 
following regret bounds can be recovered: 

1. Adaptive FTRL-Prox of (McMahan, 2014) (up to a constant factor of 2): 5 = 0. 

2. Primal-Dual AdaGrad of (Duchi et al., 2010): ro-.t = 'f’t, 5 = 0. 

3. Optimistic FTRL of (Rakhlin and Sridharan, 2013): rp = igTZ where rj > 0 and TZ a 
self-concordant function, rt = ipt = 0, Vt > 1. 


2.2 Applications 

2.2.1 Adaptive and Optimistic Gradient Descent 

Corollary 2 (AO-GD). Let /C C Ri] he an n-dimensional rectangle, and denote 

= VT,l=li9a,i -9a,iY- Set 


ro 




As,i-A, 

2Ri 


~{Xi Xs,i) 


Then, if we use the martingale-type gradient prediction gt-i-i = gt, the following regret 
bound holds: 


Regr{AO-GJ:), x) < 


i=i ^ t=i 

Moreover, this regret bound is nearly equal to the optimal a posteriori regret bound: 


^i9t,i - 5t-i,i)^- 


i?.E 


i=i \ t=i 

= max Ri 
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Notice that the regularization function is minimized when the gradient predictions be¬ 
come more accurate. Thus, if we interpret our regularization as an implicit learning rate, 
our algorithm uses a larger learning rate and accelerates as our gradient predictions become 
more accurate. This is in stark contrast to other adaptive regularization methods, such as 
AdaGrad, where learning rates are inversely proportional to simply the norm of the gradient. 

Moreover, since the regularization function decomposes over the coordinates, this accel¬ 
eration can occur on a per-coordinate basis. If our gradient predictions are more accurate 
in some coordinates than others, then our algorithm will be able to adapt accordingly. Un¬ 
der the simple martingale prediction scheme, this means that our algorithm will be able 
to adapt well when only certain coordinates of the gradient are slowly-varying, even if the 
entire gradient is not. 

In terms of computation, the AO-GD update can be executed in time linear in the 
dimension (the same as for standard gradient descent). Moreover, since the gradient pre¬ 
diction is simply the last gradient received, the algorithm also does not require much more 
storage than the standard gradient descent algorithm. However, as we mentioned in the 
general case, the regret bound here can be signihcantly more favorable than the standard 
^ bound of online gradient descent, or even its adaptive variants. 

2.2.2 Adaptive and Optimistic Exponentiated Gradient 

Corollary 3 (AO-EG). LetK. = A„ be the n-dimensional simplex and (f. x ^ 
the negative entropy. Assume that ||(;t|| < C for all t and set 


ro-.t = 


.C" + Z]s=i Il5s - 5s 


log(n) 


log(n)). 


Then, if we use the martingale-type gradient prediction gt+i = gt, the following regret bound 
holds: 


Reg-jn(AO-EG, x) 



21og(n) 


T-1 


C + ^ \\gt 


t=i 



The above algorithm admits the same advantages over predecessors as the AO-GD al¬ 
gorithm. Moreover, observe that this bound holds at any time and does not require the 
tuning of any learning rate. Steinhardt and Liang (Steinhardt and Liang, 2014) also intro¬ 
duce a similar algorithm for EG, one that could actually be more favorable if the optimal a 
posteriori learning rate is known in advance. 


2.3 CAO-FTRL algorithm (Composite Adaptive Optimistic Follow- 
the-Regularized-Leader) 

In some cases, we may wish to impose some regularization on our original optimization 
problem to ensure properties such as generalization (e.g. ^ 2 -norm in SVM) or sparsity (e.g. 
Z^-norm in Lasso). This “composite term” can be treated directly by modifying the regular¬ 
ization in our FTRL update. However, if we wish for the regularization penalty to appear 
in the regret expression but do not wish to linearize it (which could mitigate effects such as 
sparsity), then some extra care needs to be taken. 
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Algorithm 2 CAO-FTRL 

1: Input: regularization function tq > 0, composite functions {'0*}“ i where 'ipt > 0. 
2: Initialize: gi = 0, xi = argmin^g^ ro(a;). 

3: for t = 1,... ,T: do 
4: Compute gt £ dft{xt). 

5: Construct regularizer rt > 0. 

6: Predict the next gradient gt+i = gt+i{gi^... ,gt,xi,... ,Xt). 

7 : Update xt+i = a.Tgmin^^^ gi,t ■ x + gt+i ■ x + ro.,tix) + ipi-.t+iix). 

8 : end for 


We modify Algorithm 1 to obtain Algorithm 2, and we provide accompanying regret 
bounds for both proximal and general regularization functions. In each theorem, we give a 
pair of regret bounds, depending on whether the learner considers the composite term as an 
additional part of the loss. 

All proofs are provided in Appendix 5. 

Theorem 3 (CAO-FTRL-Prox). Let {rt} be a sequence of proximal non-negative functions, 
such that argmin„,^iQrt{x) = Xt, and let gt be the learner’s estimate of gt given the history 
of functions /i,..., ft-i and points xi,..., Xt-i- Let be a sequence of non-negative 

convex functions, such that '0i(a;i) = 0. Assume further that the function ho-t : x i—>■ 
<?i:t • X + gt+i ■ X + rQ.t(x) + '0i:t+i(a^) is 1-strongly convex with respect to some norm || • ||(t). 
Then the following regret bounds hold for CAO-FTRL (Algorithm 2): 

T T 

'^ftixt) - ft{x) < ifv.T-iix) + rQ,T-i{x) + ^\\gt - gt\\lt-i),* 

T T 

X! iftM + Mxt)] - [ft{x) + Mx)] < ro:T{x) + \\gt - Mft),* ■ 

t=i t=i 

Notice that if we don’t consider the composite term as part of our loss, then our regret 
bound resembles the form of AO-FTRL-Gen. This is in spite of the fact that we are using 
proximal adaptive regularization. On the other hand, if the composite term is part of our 
loss, then our regret bound resembles the one using AO-FTRL-Prox. 

Theorem 4 (CAO-FTRL-Gen). Let {rt} be a sequence of non-negative functions, and 
let gt be the learner’s estimate of gt given the history of functions fi,..., ft-i and points 
xi,, xt-i. Let {'4>t}tLi be a sequence of non-negative convex functions such that '0i(a;i) = 
0. Assume further that the function ho-t : x i—>■ gi.,t ■ x -\- gt+i ■ x -\- rQ.t{x) -\- 'ipi.,t+i{x) is T 
stronqly convex with respect to some norm 11 • |lc*v Then, the following regret bound holds 
for CAO-FTRL (Algorithm 2): 

T T 

Y - ft{x) < -ipi-.T-iix) + ro-.T-i{x) + Y \\9t - 5t|l(t-i).* 

T T 

Y - ^o-.T-i{x) + Y \\9t - 9t\\lt),* ■ 
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Algorithm 3 CAOS-FTRL 

1 : Input: regularization function tq > 0, composite functions {'0*}“ i where 'ipt > 0. 
2 : Initialize: gi = 0, xi = argmin^g^ ro(a;). 

3: for t = 1,... ,T: do 

4: Query gt where E[gt\xi,.. .,Xt,gi ,.. .,gt-i] = gt ^ dft{xt). 

5: Construct regularizer rt > 0. 

6 : Predict next gradient gt+i = 34 + 1 ( 51 ,..., gt,xi,... ,xt). 

7: Update xt+i = a.Tgmin^^^ gi,t ■ x + gt+i ■ x + ro.,tix) + ipi-.t+iix). 

8 : end for 


3 Adaptive Optimistic and Stochastic Follow-the-Regularized- 
Leader algorithms 


3.1 CAOS-FTRL algorithm (Composite Adaptive Optimistic Follow- 
the-Regularized-Leader) 

We now generalize the scenario to that of stochastic online convex optimization, where, 
instead of exact subgradient elements gt, we receive only estimates. Specifically, we assume 
access to a sequence of vectors of the form gt, where T£,[gt\gi,..., gt-i,xi,... ,Xt\ = gt- 
This extension is in fact well-documented in the literature (see (Shalev-Shwartz, 2012) for 
a reference), and the extension of our adaptive and optimistic variant follows accordingly. 
For completeness, we provide the proofs of the following theorems in Appendix 8 . 

Theorem 5 (CAOS-FTRL-Prox). Let {rt} be a sequenee of proximal non-negative fune- 
tions, such that argmin,^^f^rt{x) = Xt, and let gt be the learner’s estimate of gt given the 
history of noisy gradients gi, ■ ■ ■ ,gt-i and points xi,... ,Xt-i- Let a sequence 

of non-negative convex functions, such that 'ijji{xi) = 0. Assume further that the function 
ho-.tix) = gi-.t ■ X -\- gt+i ■ x -\- ro,t{x) -\- ipi-,t+i{x) is 1-strongly convex with respect to some 
norm || • Up). Then, the update xt+i = argmin,„, ho-.t{x) of Algorithm 3 yields the following 
regret bounds: 


E 


E 


XI- /t(a 


T 


< E 


^i..T-i{x) -b ro:T-i(a;) + X - Mfi-i),* 


X + '^t{xt) - ft{x) - at'iptix) 


< E 


ro-.rix) +J2\\9t - Mtt),* 


Theorem 6 (CAOS-FTRL-Gen). Let {r 4 } be a sequence of non-negative functions, and let 
gt be the learner’s estimate of gt given the history of noisy gradients gi,... ,gt-i and points 
xi,... ,xt-i. Let {iptff^i be a sequence of non-negative convex functions, such that tfiixi) = 
0. Assume furthermore that the function ho,t{x) = 31,4 • x -\- 34+1 • x -\- ro-.t(x) -h 'ipi:t+i(x) is 
1-strongly convex with respect to some norm || • Up). Then, the update xt+i = argmin„, hQ,t{x) 
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of Algorithm 3 yields the regret bounds: 


E 


E 




T 


< E 


-ipi-.T-iix) + ro-.T-lix) + \\gt - Mft-I), 




Y + f^t{xt) - ft{x) - ijjtix) 


< E 


ro:T-i{x) + Y \\9t - 5t||(t-i).* 


The algorithm above enjoys the same advantages over its non-adaptive or non-optimistic 
predecessors. Moreover, the choice of the adaptive regularizers {rt}’^^ and gradient predic¬ 
tions { 5 }^! now also depend on the randomness of the gradients received. While masked in 
the above regret bounds, this interplay will come up explicitly in the following two examples, 
where we, as the learner, impose randomness into the problem. 


3.2 Applications 

3.2.1 Randomized Coordinate Descent with Adaptive Probabilities 

Randomized coordinate descent is a method that is often used for very large-scale problems 
where it is impossible to compute and/or store entire gradients at each step. It is also 
effective for directly enforcing sparsity in a solution since the support of the final point Xt 
cannot be larger than the number of updates introduced. 

The standard randomized coordinate descent update is to choose a coordinate uniformly 
at random (see e.g. (Shalev-Shwartz and Tewari, 2011)). Nesterov (2012) (Nesterov, 2012) 
analyzed random coordinate descent in the context of loss functions with higher regularity 
and showed that one can attain better bounds by using non-uniform probabilities. 

In the randomized coordinate descent framework, at each round t we specify a dis¬ 
tribution pt over the n coordinates and pick a coordinate it S {l,...,n} randomly ac¬ 
cording to this distribution. From here, we then construct an unbiased estimate of an 
element of the subgradient: gt = This technique is common in the online 

learning literature, particularly in the context of the multi-armed bandit problem (see e.g. 
(Cesa-Bianchi and Lugosi, 2006) for more information). 

The following theorem can be derived by applying Theorem 5 to the gradient estimates 
just constructed. We provide a proof in Appendix 9. 

Theorem 7 (CAO-RCD). AssumeK. C Letit he a random variable sampled 

according to the distribution pt, and let 

- _ (5* • eq)e*, 2 _ {gt ■ eqjeq 


he the estimated gradient and estimated gradient prediction. Denote Ag^i = y ~ 9a,i)^> 

and let 


ro-.t = 


" * A A 

, 


2=1 s = l 


2Ri 


^ 3 , 2 ) 


2 
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be the adaptive regularization. Then, the regret of the algorithm can he bounded by: 


E 


■ T 

ft{xt) + attpiyXt) - ft{x) - attlj{x) 


2=1 




{9t,i 

Pt,i 


In general, we do not have access to an element of the subgradient gt before we sample 
according to pt- However, if we assume that we have some per-coordinate upper bound on an 
element of the subgradient uniform in time, i.e. \gtj\ < Lj Vt € {1,... ,T}, j € {1,... ,n}, 
then we can use the fact that \gt,j — gt,j\ < max{Lj —gt,j,gt,j} to motivate setting gtj := ^ 
and ptj = (by computing the optimal distribution). This yields the following 

regret bound. 


Corollary 4 (CAO-RCD-Lipschitz). Assume that at any time t the following per-coordinate 
Lipschitz bounds hold on the loss function: \gt,i\ < Li, Vi € {l,...,n}. Set pt,i = 

L 


gret of the algorithm can be bounded as follows: 


, ' 2/3 as the probability distribution at time t, and set ga = -A. Then, the re- 


E 


ft{xt) + attpixt) - ft{x) - atip{x) 


,t = l 


< 2Vt 


3/2 


An application of Holder’s inequality will reveal that this bound is strictly smaller than 
the 2RL'JnT bound one would obtain from randomized coordinate descent using the uniform 
distribution. Moreover, the algorithm above still entertains the intermediate data-dependent 
bound of Theorem 7. 

Notice the similarity between the sampling distribution generated here with the one 
suggested by (Nesterov, 2012). However, Nesterov assumed higher regularity in his algorithm 
(i.e. ft € C^'^) and generated his probabilities from there. In our setting, we only need ft € 
It should be noted that (Afkanpour et ah, 2013) also proposed an importance-sampling 
based approach to random coordinate descent for the specific setting of multiple kernel 
learning. In their setting, they propose updating the sampling distribution at each point 
in time instead of using uniform-in-time Lipschitz constants, which comes with a natural 
computational tradeoff. Moreover, the introduction of adaptive per-coordinate learning 
rates in our algorithm allows for tighter regret bounds in terms of the Lipschitz constants. 

We can also derive the analogous mini-batch update: 


Corollary 5 (CAO-RCD-Lipschitz-Mini-Batch). AssumelCc Let\Jj^i{Ilj} 

{ 1 ,... ,n} be a partition of the coordinates, and let en^ = Assume we had the 


following Lipschitz condition on the partition: \\gt ■ euj || < Lj Vj G {1,..., k}. 

Define Si = ^ pj,. Rj. Set pt^i = j , 2/3 xs the probability distribution at time t, 

and set gt^i = kf-. 

Then the regret of the resulting algorithm is bounded by: 


E 


■ T 

y^ ftixt) + atipixt) - ft{x) - at-ipix) 


< 2Vf |^y](5,L,)2/3 


3/2 




While the expression is similar to the non-mini-batch version, the Li and Ri terms now 
have different meaning. Specifically, Li is a bound on the 2-norm of the components of the 
gradient in each batch, and Ri is the 1 -norm of the corresponding sides of the hypercube. 
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Algorithm 4 CAOS-Reg-ERM-Epoch 
1: Input: scaling constant a > 0, composite term xp, tq = 0. 

2: Initialize: initial point xi G 1C, distributional. 

3: Sample ji according to pi, and set t = 1. 

4: for s = 1,..., fe: do 

5: Compute gl = V/j (xi) Vj G {1,... ,m}. 

6: for a = 1,..., T /k: do 

7: If T mod fc = 0, compute = V Vj. 

Jt 

8 : Set pt = and construct r* > 0. 

PtJt 

9: Sample jt+i ~ pt+i and set gt+i = --f-. 

10 : Update Xt+i = argmin^g^ gi,t ■ x + gt+i ■ x + ro:t(x) + (t + l)a'ip(x) and t = t + 1. 

11: end for 

12: end for 


3.2.2 Stochastic Regularized Empirical Risk Minimization 

Many learning algorithms can be viewed as instances of regularized empirical risk mini¬ 
mization (e.g. SVM, Logistic Regression, Lasso), where the goal is to minimize an objective 
function of the following form: 


H{x) = Y^ fj{x) + aip{x). 

If we denote the first term by F{x) = then we can view this objective in our 

CAOS-FTRL framework, where ft = F and tpt = ctfp- ta the same spirit as for non-uniform 
random coordinate descent, we can estimate the gradient of F[ at Xt by sampling according 
to some distribution pt and use importance weighting to generate an unbiased estimate: If 
gt € dF{xt) and gf G dfj{xt), then 


9t 



Pt,3t ' 


This motivates the design of an algorithm similar to the one derived for randomized 
coordinate descent. Here we elect to use as gradient prediction the last gradient of the 
current function being sampled fj. However, we may run into the problem of never seeing 
a function before. A logical modification would be to separate optimization into epochs 
and do a full batch update over all functions fj at the start of each epoch. This is similar 
to the technique used in the Stochastic Variance Reduced Gradient (SVRG) algorithm of 
Johnson and Zhang (2013). However, we do not assume extra function regularity as they do 
in their paper, so the bounds are not comparable. The algorithm is presented in Algorithm 4 
and comes with the following guarantee: 

Corollary 6. Assume K, C x'f^-^[—Ri,Ri]. Denote As^i = \/J2l=ii9a,i — 9a,iY , and let 
ro:t = Sl=i ‘^'’’* 2 ^ '’"^'' ~ ‘a:s,iY ^^6 adaptive regularization. 
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Then the regret of Algorithm 4 is bounded by: 


E 


ft{xt) + atjj{xt) - ft{x) - atjj{x) 




i=l \ s=l t=(s- 


k (s-l)(T/k)+T/k m 

E E E 


aii -gii 


i)(T/fc)+i i=i 


Pt,j 


Moreover, */||V/j||oo < Lj \/j, then setting pt^j = 






YAf=i 


yields a worst-case bound of: 


We also include a mini-batch version of this algorithm in Appendix 10, which can be 
useful due to the variance reduction of the gradient prediction. 


4 Conclusion 

We presented a general framework for developing efficient adaptive and optimistic algorithms 
for online convex optimization. Building upon recent advances in adaptive regularization 
and predictable online learning, we improved upon each method. We demonstrated the 
power of this approach by deriving algorithms with better guarantees than those commonly 
used in practice. In addition, we also extended adaptive and optimistic online learning to 
the randomized setting. Here, we highlighted an additional source of problem-dependent 
adaptivity (that of prescribing the sampling distribution), and we showed how one can 
perform better than traditional naive uniform sampling. 
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Appendix 

5 Proofs for Section 2 


Lemma 1 (Duality Between Smoothness and Convexity for Convex Functions). Let 1C be 
a convex set and / : /C —> IR. fee a convex function. Suppose f is 1-strongly convex at xq- 
Then f*, the Legendre transform of f, is 1-strongly smooth at yg = V/(a;o). 

Proof. Notice first that for any pair of convex functions /, g : /C —>■ K., the fact that f{xo) > 
g{xo) for some Xq C fC implies that f*{yo) < 9*{yo) for yo = V/(a:o). 

Now, / being 1-strongly convex at a:o means that f{x) > h(x) = f{xo) go ■ {x — xq) 
^\\x — xoWl- Thus, it suffices to show that h*{y) = f*{yo) + xo- {y — yo) + ^\\y — yollij since 
xo = y{h*){yo). 

To see this, we can compute that 

h*{y) = maxy • x — h{x) 

X 

= y ■ {y - yo +Xo) - h{x) 

(max attained yo + {x — xo) = Vfe,(a;) = y) 

= y{y-yo + xo) 

- f{xo)+yo-{x-xo)p]^\\x-XoWl 
= ]^\\y-yo\\l + yxo- f{xo) 

= -f{xo) + xo-yo + xo- {y- yo) + \\\y- VoWl 
= f*{yo) + xo-{y- yo) + ^Wv - VoWl 

□ 

Theorem 2 (AO-FTRL-Gen). Let {rt} be a sequence of non-negative functions, and let 
gt be the learner’s estimate of gt given the history of functions fi,..., ft-i and points 
xi,..., Xt-i. Assume further that the function ho-t'. x i—>■ gi.,t ■ x gt+i ■ x -\- ro..t(x) is T 
strongly convex with respect to some norm || • ||(t) (i.e. ro-.t is Tstrongly convex wrt || • ||(t)y)- 
Then, the following regret bound holds for AO-FTRL (Algorithm 1): 

T T 

'^hixt) - ft{x) < ro:T-i{x) +J2\\9t - 

Proof. Recall that = argmin^ x • {gi^t + and let yt = argmin^. x • gi:t + 

ro:t_i(x). Then by convexity, 

T T 

'P^Mxt) - ft{x) < • (xt - x) 

T 

= P,{9t - gt) ■ {xt - yt) + gt ■ {xt - yt) + gt ■ {yt - x) 

t=l 
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Now, we first show via induction that Vx G /C, the following holds: 

T T 

'^9f{xt-yt) + gfyt<'^gfx + ro,T-i{x). 

For T = 1, the fact that rt > 0, = 0, and the definition of yt imply the result. 


Now suppose the result is true for time T. Then 


T+l 

9t ■ {xt - yt) + gfyt 


■ T 

gt ■ {xt - yt) +gfyt 


< 


+ 9T+1 ■ {xt+1 — yr+i) + gr+i ■ yr+i 

■ T 

Y,9t ■ XT+l + ro,T-l{xT+l) 

.t=l 


+ 9T+1 ■ {xt+ 1 — yr+i) + ffT+i ■ yr+i 
(by the induction hypothesis for x = xt+i) 
^ [{gi.T + gr+i) ■ Xt+1 + r+,T{xT+i)] 

+ 5T+1 • (-J/T+i) + gr+i ■ VT+i 
(since rt > 0, 'it) 

< [{g+.T + gr+i) ■ yr+i + ?'0:t(?/t+i)] 

+ 5T+1 • (-2/T+i) + gr+i ■ yr+i 
(by definition of a;T+i) 

< Si:T+i • y + ro-riy), for any y. 

(by definition of yr+i) 


Thus, we have that Yt=i Mxt) - Mx) < ro.,T-i{x) + J^tLiigt “ 5*) ‘ - yt) and 

it suffices to bound Y^=ii.gt ~ gt)'^{xt — yt)- By duality again, one can immediately get 
{gt - gt) ■ {xt - yt) < \\gt - 5 t||(t-i),»l|a;t - yt\\{t-i)- To bound ||a;t - yt\\(t) in terms of the 
gradient, recall first that 


Xt = argmin/io:t_i(a;) 

X 

yt = argminho:t-i(a;) + {gt - gt) ■ x. 

X 

The fact that ro:t_i(a;) is 1-strongly convex with respect to the norm || • ||(t-i) implies that 
ho-t-i is as well. In particular, it is strongly convex at the points Xt and yt- But, this then 
implies that the conjugate function is smooth at V(ho:t-i)(a;t) and V(ho:t-i)(?/t), so that 

\\^{ho-.t-i){-{gt-~gt)) 

- V(/iS,t_i)(0)||(t) < \\gt - 5t||(t-i),* 

Since '^ihQ.j._i){-{gt - gt)) = yt and V{h^.f_j^){0) = Xt, we have that \\xt - yt\\(t-i) < 
\\gt — gt\\{t-i),*- 

□ 
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Theorem 3 (CAO-FTRL-Prox). Let {rt} be a sequence of proximal non-negative functions, 
such that argmin„,^iQrt{x) = Xt, and let gt be the learner’s estimate of gt given the history 
of functions fi,..., ft-i and points xi,..., Xt-i- Let ^ be a sequence of non-negative 

convex functions, such that ifi{xi) = 0. Assume further that the function ho-,t ■ x i—>• 
gi-.t ■ X + gt+i ■ X + rQ.,t(x) + ifi-t+iix) is 1-strongly convex with respect to some norm || • ||((). 
Then the following regret bounds hold for CAO-FTRL (Algorithm 2): 

T T 

^ft{xt) - ft{x) < iJi,T-i{x) + ro-.T-i{x) + ^\\gt - gt\\%-i),* 

T T 

IMxt) + M^t)] - [Mx] + M^)] < ro:T{x) + Y \\9t - Mft),* ■ 

Proof. For the first regret bound, define the auxiliary regularization functions (x) = (x)+ 

'0i(x), and apply Theorem 2 to get 

T T 

- ft{x) < ho-.T-i{x) +Y\\9t - Mft-i),* 

T 

= lfi,T-l{x) + ro:T-l(x) + Y \\9t - 

t=l 

Notice that while rt is proximal, ft, in general, is not, and so we must apply the theorem 
with general regularizers instead of the one with proximal regularizers. 

For the second regret bound, we can follow the prescription of Theorem 1 while keeping 
track of the additional composite terms: 

Recall that Xt+i = argmin,^, x-(gi:t+gt+i)+7’o:4+i(x)+'!/'i:t+i(a:), and \etyt =argmin,j,x- 
gi-.t + ro,t{x) + ipi-.tix). 

We can compute that: 

T T 

- [ft{x) + 'ipt{x)] <Y9fi^t~^)+ i^tixt) - ipt{x) 

T 

= Y^9t - gt) ■ {xt - yt) 

+ gt ■ {xt - yt) + gt ■ (yt - a;) + tjjtixt) - tjjt{x) 

Similar to before, we show via induction that Va: £ /C, Yt^igf{xt — yt)+gfyt + i^t{xt) < 
a'0:T(a:) + J2t=i 9t ' x + tftix). 

For T = 1, the fact that rt > 0, gi = 0, '0i(xi) = 0, and the definition of yt imply the 
result. 
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Now suppose the result is true for time T. Then 


T+l 

T! “ y^) + 9fyt + -iptixt) 


T! 9t ■ {xt - yt) + 5t • 2/t + i’tixt) 

+ PT+i • (a^T+i — 2/T+i) + 9T+1 ■ yr+i 
+ i’T+liXT+l) 


< 


■ T 

Tff* • ^T+l + ro.,T{xT+l) +iJt{xT+l) 
.4=1 


+ 9T+1 ■ {xT+1 — yr+i) + gr+i ■ yr+i 

+ ' 0 T+i(a;T+i) 

(by the induction hypothesis for x = St+i) 

< {yi-T + gr+i) ■ XT+i + roiT+iCa^T+i) + '0t(a;T+i) 

+ 9T+1 ■ {—yr+i) + 9T+1 ■ yr+i 
+ V'T+iCa^T+i) 

(since r* > 0, Vt) 

< (ffliT + 9T+i) • VT+I + l’0:T+l(yT+l) + i>t{yT+l) 

+ 9T+1 ■ ( —2/T+l) + 9T+1 ■ Vt+i 
+ V'T+l(?/T+l) 

(by definition of xt+i) 

< 91-T+i ■ y + ro-.T+i{y) + i/Ji-.T+iiy), for any y 

(by definition of yr+i) 


Thus, we have that 

T T 

X! Mxt) + Mxt) - [Mx) + ipt{x)] < ro-.T{x) + '^{yt - 9tf'{xt - Vt), 

4=1 4=1 

and we can bound the sum in the same way as before, since the strong convexity properties 
of ho-t are retained due to the convexity of ipt. 

□ 

Theorem 4 (CAO-FTRL-Gen). Let {rt} be a sequence of non-negative functions, and 
let gt be the learner’s estimate of gt given the history of functions and points 

xi,..., xt-i- Let {'04}“ 1 be a sequence of non-negative convex functions such that 'ipi{xi) = 
0. Assume further that the function ho-t : x i—>■ gi-.t ■ x + gt+i ■ x + ro-,t{x) + '0i:t+i(a;) is 1- 
stronqlu convex with respect to some norm 11 • |b+v Then, the following reqret bound holds 
for CAO-FTRL (Algorithm 2): 

T T 

ft{xt) - ft{x) < -ipi-.T-iix) + ro-.T-i{x) + Y \\9t - 54 ||( 4 - i ).* 

T T 

Y - ^o-.T-i{x) + Y \\9t - gt\\{t),* ■ 

£=1 
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Proof. For the first regret bound, define the auxiliary regularization functions ft(x) = rt{x)+ 
atipix), and apply Theorem 2 to get 

T T 

^ ro,T-i{x) +J2\\9t - Mft),* 

T 

= ifi-.T-iix) + ro:T-i{x) + \\9t - 

t=i 

For the second bound, we can proceed as in the original proof, but now keep track of 
the additional composite terms. 

Recall that Xt+i = argmin^ x ■ {gi:t + gt+i)+xo.t{x) + ipi-,t+i{x), and let yt =aT:gmhi^x- 
gi:t + ro.,t-i{x) P-ipi^tix). Then 

T T 

X! + '^t{xt) - ft{x) - lft{x) <^gf {xt- x)+ lft[xt) - tptix) 

T 

= '^{gt - 9t) ■ {xt - yt) + gt ■ (xt - yt) 

+ gt ■ {yt -x) + iftixt) - fpt{x) 

Now, we show via induction that Vx G /C, ■ {xt — yt) + gt ■ Vt + cttipixt) < 

SLiffi ■ x + tpt{x) +ro:T-l{x). 

For T = 1, the fact that rt > 0, gi = 0, ^i(xi) = 0, and the definition of yt imply the 
result. 
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Now suppose the result is true for time T. Then 


T+l 

X] 5t • {xt - yt) +9fyt + 


X 9t ■ {xt - yt) +gfyt + Mxt) 

+ 9T+1 ■ {xt+1 — yr+i) + gr+i ■ yr+i 
+ IpT+lixT+l) 


< 


■ T 

'^gfxT+l + ro,T-l{XT+l) + 1 pt{xT+l) 
.t=l 


+ 9T+1 ■ {xt+ 1 — yr+i) + gx+i ■ yr+i 

+ '>Pt+i{xt+i) 

(by the induction hypothesis for x = xt+i) 

< + 9T+i) ■ Xt+ 1 + ro-T{xT+l) + V't(^T+l)] 
+ 9T+1 ■ ( —2/T+l) + ffT+1 ■ 2/T+l 

+ V’T+i(a;T+i) 

(since r* > 0, Vt) 

< '?i:r+i ■ 2/T+i + gr+i ■ yr+i + ?’o:t(2/t+i) 

+ '01:T+l(2/T+l) 

+ 9T+1 ■ (-2/T+i) + gr+i ■ VT+I 
(by definition of xt+i) 

< gi,T+i ■ y + ro:T{y) + V’1:T+i( 2/), for any y 
(by definition of j/t+i) 


Thus, we have that ft{xt) + ipt{xt) - ft{x) - ijitix) < ro:T-i{x) + - 9t) ■ 

(xt — yt) and the remainder follows as in the non-composite setting since the strong convex¬ 
ity properties are retained. 


□ 


6 Proofs for Section 2.2.1 


The following lemma is central to the derivation of regret bounds for many algorithms 
employing adaptive regularization. Its proof, via induction, can be found in Auer et al 
( 2002 ). 


Lemma 2. Let be a sequence of non-negative numbers. 


Then T,]=i 


5I]fc=i “fc 


< 


Corollary 2 (AO-GD). Let 1C C x^^i[—Ri, Ri] be an n-dimensional rectangle, and denote 
= \/J2l=li9a,i - 9a,i)"^- Set 


ro-.t 


A. i-A. 


2 = 1 S = 1 


2Ri 


- (Xj ^s,i') 
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Then, if we use the martingale-type gradient prediction gt+i = gt, the following regret bound 
holds: 


Regxix) < 


i=i \ t=i 

Moreover, this regret bound is nearly egual to the optimal a posteriori regret bound: 




max 




E igt.i - 9 t-i,i)^ = maxi?, 

6 . 


\ 


i=l \ t=l 

Proof. ro:t is 1-strongly convex with respect to the norm: 

^f^a=1^9a,i ~ 9a,i)^ 


s^0,(s,l)<n 


I diag{s)~ 




\^\\h = I] 


which has corresponding dual norm: 




= E 


Ri 


Ri 


=x? 


\/'^a=l(9a,i 9a, 


By the choice of this regularization, the prediction gt = gt-i, and Theorem 3, the following 
holds: 

RegT(Ax) < t.i: 


2 = 1 S = 1 

T 


2R^ 


+ 

n 

= Y.2R, 


1 II (t),+ 


2=1 


\ ^ ^*- 1 , 2 )^ 

\ £=1 


EE 


Ri{ 9 t,i - gt-i,iY 


1 1 —1 (ffg.z <?a —l,i)^ 

n 


2=1 


\ t=l 


+ 2Ri 
i=l 

by Lemma 2 


A '^{9t,i - gt-i,iy 

\ t=i 


The last statement follows from the fact that 


T n 2 

inf 

s'^0As,l,)<n^^ Si 

r : \ — . — T •—T L 
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since the infimum on the left hand side is attained when Si oc || 5 i:T,i|| 2 - 


□ 


7 Proofs for Section 2.2.2 


Corollary 3 (AO-EG). Let 1C = An be the n-dimensional simplex and <f. a; i—>■ 
the negative entropy. Assume that || 5 t|| < C for all t and set 


ro-.t = 


jC" + Z]s=i Il5s ~ 9s 


log(n) 


-((/j + log(n)). 


Then, if we use the martingale-type gradient prediction gt+i 
holds: 


Regj. {A, x) <2 


\ 


T-1 


21 og(n) C'+ ^ \\gt - 


gt the following regret bound 



Proof. Since the negative entropy tp is 1-strongly convex with respect to the Zi-norm, r^-t is 


| C'+Sa = l llfls—3a|| 

' log(ra) 


--strongly convex with respect to the same norm. 


Applying Theorem 2 and using the fact that the dual of is loo along with <Q yields 
a regret bound of: 


Regr(Aa;) < ro,T-iix) + ^\\gt - gt\\ft-i),* 


< + llg« ^^ll^ (y, + log(n)) 


log(n) 


Y— /— 

C + j:ll\\\9s-~9s\\l 


login) 


■\\9t-~9t\\l 


< 




T-1 


2 -b ^ - gsWloj logH 


^ v^V SLi Il5s -5 s|IL 


< 


\ 


2 -b ^ log(n) 


\ 


21 og(n)X! Ilffi -5i|lc 


< 2 


\ 


T-1 


2 C’T ^ Il 5 s - 5 sIIL log(n). 


□ 
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8 Proofs for Section 3 


Theorem 5 (CAOS-FTRL-Prox). Let {rt} be a sequence of proximal non-negative func¬ 
tions, such that argmin,^^f^rt{x) = Xt, and let gt be the learner’s estimate of gt given the 
history of noisy gradients gi, - ■ ■ ,gt-i and points xi,... ,xt-i. Let be a sequence of 

non-negative convex functions, such that ifiixi) = 0. Assume further that the function 

ho-.tix) = gi,t ■ X + gt+i ■ x + ro:t(a;) + ilii-.t+iix) 


is 1-strongly convex with respect to some norm ||•||(t)■ Then, the update Xt+i = argmin^ ho;t{x) 
of Algorithm 3 yields the following regret bounds: 


E 


E 


XI- /t(a 


T 


< E 


^i,T-i{x) + ro:T-i(a;) + X 




X + '^t{xt) - ft{x) - atipt[x) 


< E 


xo-.t{x) +J 2 \\ 9 t - gtWtt),* 


Proof. 


E 




,t=l 


< X 

T 

= X'^ [E[fft|ffi, ■ • ■,gt-i,xi,.. .,xtf{xt - a;)] 

T 

= XlE[E[gt • {xt - x)\gi,.. .,gt-i,xi ,... ,Xt]] 

T 

= X 


This implies that upon taking an expectation, we can freely upper bound the difference 
ft{xt) — ft{x) by the noisy linearized estimate gt ■ {xt — x). After that, we can apply 
Algorithm 2 on the gradient estimates to get the bounds: 


E 


E 


X (xt - ■■ 


T 


< E 


i:i..T-i{x) + ro:T-i{x) + X - 9 t\\ft-i), 




'^9l'{xt-x)-\-'tft{xt)-Mx) 


< E 


ro:T{x) +J2\\9t -1 


1 ( 0 . 


□ 


Theorem 6 (CAOS-FTRL-Gen). Let {rt} be a sequence of non-negative functions, and 
let gt be the learner’s estimate of gt given the history of noisy gradients gi,..., gt-i and 
points xi,... ,Xt-i. Let {tptft^i be a sequence of non-negative convex functions, such that 
ipi{xi) = 0. Assume furthermore that the function 

ho-.tix) = gi,t ■ X + gt+i ■ x + ro:t(a:) + il}i:t+i{x) 
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is 1-strongly convex with respect to some norm ||•||(t)• Then, the update Xt+i = argmin„, ho-.tix) 
of Algorithm 3 yields the regret bounds: 


E 


E 




T 


< E 


ifi-.T-iix) + ro:T-i{x) + \\9t - 5i||(t-i),* 




X] Mxt) + tptixt) - Mx) - iftix) 


< E 


ro,T-i{x) + \\9t - gtWft-i),* 


Proof. The argument is the same as for Theorem 5, except that we now apply the bound of 
Theorem 4 at the end. □ 


9 Proofs for Section 3.2.1 


Theorem 7 (CAO-RCD). Assume K, C Let it be a random variable sampled 

according to the distribution pt, and let 

» idt ■ ~ (St ■ 

9t = -, 9t = 


Pt,i 


Pt-. 


be the estimated gradient and estimated gradient prediction. Denote Ag^i = y^a=i(9a-,i ~ 9a,i)^> 
and let 


n t 


ro 


EE 




{Xi Xs.^i)' 


2=1 s = l 

he the adaptive regularization. Then the regret of the resulting algorithm is bounded by: 

T 


E 


ft{xt) + ati}{xt) - ft{x) - attj}{x) 




<4E^' 


2=1 \ t=l 


Ee 


{9t,i-9t,if 


Pt,^ 


Proof. We can first compute that 


E [gt] = E 


(fit ■ Cit)ei 


Pt,-. 


= E 


(gt • ei)e^ 




-Pt,i = 9t 


and similarly for the gradient prediction gt. 


Now, as in Corollary 2, the choice of regularization ensures us a regret bound of the 
form: 



FT 1 

n 

T 

E 

E /*( 2 ;t) + attfixt) - ft{x) - attfix) 

< 4 RiE 

2^1 

-1 

(M 

1 
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Algorithm 5 CAOS-Reg-ERM-Epoch-Mini-Batch 


1: Input: scaling constant a > 0, composite term ijj^ tq = 0, partitions = 

2: Initialize: initial point xi G 1C, distribution pi over {1, 

3: Sample ji according to pi, and set t = 1. 

4: for s = 1,. .., fe: do 

5: Compute gl = Vfj{xi) Vj G {1,... ,m}. 

6: for a = 1,..., T /k: do 

7: If T mod fc = 0, compute = Vfj{xt) Vj. 

511 en 9t 

8: Set gt = —-—-—, and construct > 0. 

PiJt 

Sample jt+i Pt+i- 

- Pt,.* ■ 


9: 

10 

11 

12 

13 


Update Xt+i = argmin^g^^; gi,t ■ x + gt+i ■ x + + (t + l)a'tl){x) and t = t + l. 

end for 
end for 


Moreover, we can compute that: 



T 



FT 1 

\ 

E(5i,i - 5t.»)^ 


E 

EEit[(5t,i - 5t.0^] 


t=i 

\ 


Lt^i J 


\ 




{9t,^ - 




Pt,'. 


□ 


10 Further Discussion for Section 3.2.2 


We present here Algorithm 5, a mini-batch version of Algorithm 4, with an accompanying 
guarantee. 

Corollary 7. Assume K, C i?i]. Let he a partition of 

the functions /*, and let en, = Denote = \/Y^a=i^9a,i ~ 9a,and let 

ro:t = X]r=i Sl=i ~ Xs^iy be the adaptive regularization. 

Then the regret of Algorithm 5 is hounded by: 


E 


E + atpixt) - ft{x) - aif{x) 


,t=l 




1 


k (s-l)(T/k)+T/k I 

E E E 

s=l t=(s-l)(T/fe)-|-l a=l 


Ejenj 5m “ 9i,, 


Pt,a 


Moreover, if ||V/j||oo < Lj Vj, then setting ptj = 






JCfLi Lj 


yields a worst-case bound of: 
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A similar approach to Regularized ERM was developed independently by (Zhao and Zhang, 
2014). However, the one here improves upon that algorithm through the incorporation of 
adaptive regularization, optimistic gradient predictions, and the fact that we do not assume 
higher regularity conditions such as strong convexity for our loss functions. 
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